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Full text available: *g) pdf(385.22 KB) 



This tutorial surveys design methods for energy-efficient system-level design. We 
consider electronic sytems consisting of a hardware platform and software layers. We 
consider the three major constituents of hardware that consume energy, namely 
computation, communication, and storage units, and we review methods of reducing their 
energy consumption. We also study models for analyzing the energy cost of software, and 
methods for energy-efficient software design and compilation. This survery ... 

A way-halting cache for low-energy high-performance systems 

Chuanjun Zhang, Frank Vahid, Jun Yang, Walid Najjar 

March 2005 ACM Transactions on Architecture and Code Optimization (TACO), Volume 2 

Issue 1 
Publisher: ACM Press 

Full text available: *g]pdf(1. 32 MB) Additional Information: full citation, abstract, r eferences , ind ex term s 

Caches contribute to much of a microprocessor system's power and energy consumption. 
Numerous new cache architectures, such as phased, pseudo-set-associative, way 
predicting, reactive-associative, way-shutdown, way-concatenating, and highly- 
associative, are intended to reduce power and/or energy, but they all impose some 
performance overhead. We have developed a new cache architecture, called a way-halting 
cache, that reduces energy further than previously mentioned architectures, while 
imposing ... 
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Full text available: l S3.Bdf(236^33 KB) 

— t^rns 

Caches contribute to much of a microprocessor system's power and energy consumption. 
We have developed a new cache architecture, called a way-halting cache, that reduces 
energy while imposing no performance overhead. Our way-halting cache is a four-way 
set-associative cache that stores the four lowest-order bits of all ways' tags into a fully 
associative memory, which we call the halt tag array. The lookup in the halt tag array is 
done in parallel with, and is no slower than, the set-index decod ... 

Keywords: cache design, low power techniques 
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Chuanjun Zhang 
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Publisher: IEEE Computer Society, ACM Press 

Full text available: *g] pdf(391.3Q KB) Additional Information: full citation , abstract , index terms 

Level one cache normally resides on a processor's critical path, which determines the 
clock frequency. Directmapped caches exhibit fast access time but poor hit rates 
compared with same sized set-associative caches due to nonuniform accesses to the 
cache sets, which generate more conflict misses in some sets while other sets are 
underutilized. We propose a technique to reduce the miss rate of direct mapped caches 
through balancing the accesses to cache sets. We increase the decoder length and th ... 

5 A predictiv e decod e fil ter cac he for reducing power consumption in embedded 
^ processors 

Weiyu Tang, Arun Kejariwal, Alexander V. Veidenbaum, Alexandru Nicolau 
April 2007 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 12 Issue 2 

Publisher: ACM Press 

Full text available: *^|pdf (267.90 KB) Additional Information: full c i ta tion, abstract, references, index terms 

With advances in semiconductor technology, power management has increasingly become 
a very important design constraint in processor design. In embedded processors, 
instruction fetch and decode consume more than 40&percnt; of processor power. This 
calls for development of power minimization techniques for the fetch and decode stages of 
the processor pipeline. For this, filter cache has been proposed as an architectural 
extension for reducing the power consumption. A filter cache is placed betw ... 

Keywords: Cache, embedded processors, power optimization 
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^ Peter Soderquist, Miriam Leeser 

November 1997 Proceedings of the fifth ACM international conference on Multimedia 
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September 2003 ACM Computing Surveys (CSUR), volume 35 issue 3 
Publisher: ACM Press 

Additional Information: fu ll c itation, a bstract , refe re nces, citings, in de x 



Full text available: 1 ^ edf(443.89_KB) 

" t erms 

Program code compression is an emerging research activity that is having an impact in 
several production areas such as networking and embedded systems. This is because the 
reduced-sized code can have a positive impact on network traffic and embedded system 
costs such as memory requirements and power consumption. Although code-size 
reduction is a relatively new research area, numerous publications already exist on it. The 
methods published usually have different motivations and a variety of appli ... 

Keywords: code compaction, code compression, method assessment, method evaluation 



8 Early load address resolution via register tracking 

Michael Bekerman, Adi Yoaz, Freddy Gabbay, Stephan Jourdan, Maxim Kalaev, Ronny Ronen 
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annual international symposium on Computer architecture ISCA '00, volume 

28 Issue 2 

Publisher: ACM 

en* * i u. a^f/-i>iooi/D\ Additional Information: full citation , abstra ct, reference s, cited by. index 
Full text available:^ pdf(143.17 KB) ~~ J 

" ~ terms 

Higher microprocessor frequencies accentuate the performance cost of memory accesses. 
This is especially noticeable in the Intel's IA32 architecture where lack of registers results 
in increased number of memory accesses. This paper presents novel, non-speculative 
technique that partially hides the increasing load-to-use latency, by allowing the early 
issue of load instructions. Early load address resolution relies on register tracking to safely 
compute the addresses of memory refere ... 

9 Compiler-driven cached code compression schemes for embedded ILP processors 

Sergei Y. Larin, Thomas M. Conte 

November 1999 Proceedings of the 32nd annual ACM/IEEE international symposium 

on Microarchitecture MICRO 32 

Publisher: IEEE Computer Society 

Full text available: ~~ 
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m pdf(1 24 MB) ® Additional Information: full c itatio n , abstract , references , citings, index 



During the last 15 years, embedded systems have grown in complexity and performance 
to rival desktop systems. The architectures of these systems present unique challenges to 
processor microarchitecture, including instruction encoding and instruction fetch 
processes. This paper presents new techniques for reducing embedded system code size 
without reducing functionality. This approach is to extract the pipeline decoder logic for an 
embedded VLIW processor in software at system develo ... 

10 Near-Op timal Prechar ging in Hi g h-Performance Nanoscale CMOS Caches 

Se-Hyun Yang, Babak Falsafi 

December 2003 Proceedings of the 36th annual IEEE/ACM International Symposium 

on Microarchitecture MICRO 36 

Publisher: IEEE Computer Society 

Full text available: c jg?| pdf(206.92 KB) Additional Information: full jcitation, abstract, citings, index terms 

High-performance caches statically pull up the bit-linesin all cache subarrays to optimize 
cache accesslatency. Unfortunately, such an architecture results in asignificant waste of 
energy in nanoscale CMOS implementationsdue to high leakage and bitline discharge 
inthe unaccessed subarrays. Recent research advocatesbitline isolation to control 
precharging of individualsubarrays using bitline precharge devices. In this paper,we 
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carefully evaluate the energy and performancetrade-offs of bitline iso ... 

11 GPGP U : ge n eral purpose computation o n gra phics hardware 

David Luebke, Mark Harris, Jens Kruger, Tim Purcell, Naga Govindaraju, Ian Buck, Cliff 
Woolley, Aaron Lefohn 

August 2004 ACM SIGGRAPH 2004 Course Notes SIGGRAPH '04 
Publisher: ACM Press 

Full text available: c nC) pdf(63 .03 MB) Additional Information: full citation , abstract , citin gs 





The graphics processor (GPU) on today's commodity video cards has evolved into an 
extremely powerful and flexible processor. The latest graphics architectures provide 
tremendous memory bandwidth and computational horsepower, with fully programmable 
vertex and pixel processing units that support vector operations up to full IEEE floating 
point precision. High level languages have emerged for graphics hardware, making this 
computational power accessible. Architecturally, GPUs are highly parallel s ... 

12 DSTRIDE: data-cache miss-address-based stride prefetching scheme for multimedia 
processors 

Hariprakash. G, Achutharaman. R, Amos R. Omondi 

January 2001 Australian Computer Science Communications , Proceedings of the 6th 

Australasian conference on Computer systems architecture ACSAC '01, 

Volume 23 Issue 4 

Publisher: IEEE Computer Society, IEEE Computer Society Press 
Full text available: fiB pdf(928. 14 KB) 

Additional Information: full citation, abstract, references 

Publisher Site 




Prefetching reduces cache miss latency by moving data up in memory hierarchy before 
they are actually needed. Recent hardware-based stride prefetching techniques mostly 
rely on the processor pipeline information (e.g. program counter and branch prediction 
table) for prediction. Continuing developments in processor microarchitecture drastically 
change core pipeline design and require that existing hardware-based stride prefetching 
techniques be adapted to the evolving new processor architectures. ... 

1 3 In struct ion fetch a nd co ntrol flow: Branch predictor guided ins tru ction de codin g 

Oliverio J. Santana, Ayose Falcon, Alex Ramirez, Mateo Valero 
September 2006 Proceedings of the 15th international conference on Parallel 

architectures and compilation techniques PACT '06 
Publisher: ACM Press 

Full text available: pdf(369.16 KB ) Additional Information: full citation , a bstrac t, references , index terms 





Fast instruction decoding is a challenge for the design of CISC microprocessors. A well- 
known solution to overcome this problem is using a trace cache. It stores and fetches 
already decoded instructions, avoiding the need for decoding them again. However, 
implementing a trace cache involves an important increase in the fetch architecture 
complexity. In this paper, we propose a novel decoding architecture that reduces the fetch 
engine implementation cost. Instead of using a special-purpose buffer ... 

Keywords: branch predictor, complexity-effective, instruction decoding 
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Trace caches are used to help dynamic branch prediction make multiple predications in a 
cycle by embedding some of the predictions in the trace. In this work, we evaluate a trace 
cache that is capable of delivering a trace consisting of a variable number of instructions 
via a linked list mechanism. We evaluate several schemes in the context of an x86 
processor model that stores decoded instructions. By developing a new classification for 
trace cache accesses, we are able to target those misses t ... 

Keywords: branch prediction, optimization, trace cache, x86 
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Energy consumption is a major concern in many embedded computing systems. Several 
studies have shown that cache memories account for about 50&percnt; of the total energy 
consumed in these systems. The performance of a given cache architecture is determined, 
to a large degree, by the behavior of the application executing on the architecture. 
Desktop systems have to accommodate a very wide range of applications and therefore 
the cache architecture is usually set by the manufacturer as a best compr ... 
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Deep-submicron designs have to take care of process variation effects as variations in 
critical process parameters result in large variations in access latencies of hardware 
components. This is severe in the case of memory components as minimum sized 
transistors are used in their design. 

In this work, by considering on-chip data caches, we study the effect of access latency 
variations on performance. We discuss performance losses due to the worst-case design, 
wherein the entire cache o ... 
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This article presents Cool-Mem, a family of memory system architectures that integrate 
conventional memory system mechanisms, energy-aware address translation, and 
compiler-enabled cache disambiguation techniques, to reduce energy consumption in 
general-purpose architectures. The solutions provided in this article leverage on interlayer 
tradeoffs between architecture, compiler, and operating system layers. Cool-Mem 
achieves power reduction by statically matching memory operations with energy-eff ... 
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A Decoded INstruction Cache (DINC) serves as a buffer between the instruction decoder 
and the other instruction-pipeline stages. In this paper we explain how techniques that 
reduce the branch penalty based on such a cache, can improve CPU performance. We 
analyze the impact of some of the design parameters of DINCs on variable instruction- 
length computers, e.g., CISC machines. Our study indicates that tuning the mapping 
function of the instructions into the cache, can improve the ... 
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Conventional processors use a fully-associative store queue (SQ) to implement store-load 
forwarding. Associative search latency does not scale well to capacities and bandwidths 
required by wide-issue, large window processors. In this work, we improve SQ scalability 
by implementing store-load forwarding using speculative indexed access rather than 
associative search. Our design uses prediction to identify the single SQ entry from which 
each dynamic load is most likely to forward. When a load exec ... 
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